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Abstract 

Weighted logic programming, a generalization of bottom-up logic programming, is a well- 
suited framework for specifying dynamic programming algorithms. In this setting, proofs 
correspond to the algorithm's output space, such as a path through a graph or a gram- 
matical derivation, and are given a real-valued score (often interpreted as a probability) 
that depends on the real weights of the base axioms used in the proof. The desired out- 
put is a function over all possible proofs, such as a sum of scores or an optimal score. 
We describe the PRODUCT transformation, which can merge two weighted logic programs 
into a new one. The resulting program optimizes a product of proof scores from the origi- 
nal programs, constituting a scoring function known in machine learning as a "product of 
experts." Through the addition of intuitive constraining side conditions, we show that sev- 
eral important dynamic programming algorithms can be derived by applying PRODUCT to 
weighted logic programs corresponding to simpler weighted logic programs. In addition, 
we show how the computation of KuUback-Leibler divergence, an information-theoretic 
measure, can be interpreted using PRODUCT. 

KEYWORDS: weighted logic programming, program transformations, natural language 
processing 



1 Introduction 

Weighted logic programming is a technique that can be used to declaratively specify 
dynamic programming algorithms in a number of fields such as natural language 



processing (Manning and Schiitze 1999) and computational biology (Durbin et al. 
1998 1 . Weighted logic programming is a generalization of bottom-up logic program- 
ming where each proof is assigned a score (or weight) that is a function of the scores 
of the axioms used in the proof. When these scores are interpreted as probabilities, 
then the solution to a whole weighted logic program can be interpreted in terms of 
probabilistic reasoning about unknowns, implying that the weighted logic program 
implements probabilistic inference^ 



* To appear in Theory and Practice of Logic Programming (TPLP). 

^ The word inference has a distinct meaning in logic programming (e.g. "inference rule," "valid in- 
ference"), and so we will attempt to avoid confusion by using the probabilistic modifier whenever 
we are talking about probabilistic reasoning about unknowns. 
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Even though weighted logic programming is not hmited to probabihstic inference, 
it is worth detaihng their relationship. Let /, A, and P be random variables, where 
the values of / and A are known and the value of P is not known. Often there is a 
correspondence where 

• / corresponds to a conditional "input," encoded as axioms, known to be true; 

• A corresponds to a set of axioms known to be true; and 

• P corresponds to a deductive proof of the goal theorem using the axioms. 

In the setting of weighted logic programming, there may be many different proofs of 
the goal given the set of axioms. We must therefore distinguish the weighted logic 
program from the "world" we are reasoning about in which these many different 
proofs of the goal correspond to different, mutually exclusive events, each of which 
has some probability of occurring. Weighted logic programming implements proba- 
bilistic inference over the value of the proof random variable P given the values of 
A and /: the weighted logic program implies a probability distribution p{P \ A,/), 
and it can be used to compute different useful quantities related to the distribution. 

Previous work on weighted logic programming has shown that certain families of 
probabilistic models lend themselves extremely well to weighted logic programming 
as an inference mechanism. In general, weighted logic programming deals with 
probability distributions over objects with combinatorial structure — paths through 
graphs, grammatical derivations, and sequence alignments — that are quite useful 
in computer science applications. 

In principle, one can think about combining such distributions with each other, 
creating distributions over even more complex structures that are related. This 
paper is about a natural extension to weighted logic programming as probabilistic 
inference over structures: combining weighted logic programs to perform inference 
over two or more structures. We describe a program transformation, PRODUCT, that 
implements joint probabilistic inference via weighted logic programming over two 
structured variables Pi and P2, when (a) each of the two separate structures can be 
independently reasoned about using weighted logic programming, and (b) the joint 
model factors into a product of two distributions p{Pi \ ^1, /i) and p{P2 \ A2, /2)|^ 

As a program transformation on traditional logic programs, PRODUCT is not novel; 



it has existed as a compiler transformation for over a decade (Pettorossi and Prof 



etti 1994 Pettorossi 1999| ). As a way of describing joint probabilistic inference in 



weighted logic programming, the transformation has been intuitively exploited in 
designing algorithms for specific applications, but has not, to our knowledge, been 
generalized. The contribution of this paper is a general, intuitive, formal setting for 
dynamic programming algorithms that process two or more conceptually distinct 
objects. Indeed, we show that many important dynamic programming algorithms 
can be derived using simpler "factor" programs and the PRODUCT transformation 
together with side conditions that capture the relationship between the structures. 
The paper is organized as follows. In ^ we give an overview of weighted logic 



^ In the language of probability, this means that Pi and P2 are conditionally independent given 
Ai, A2, h, and h- 
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reachable(Q) :- initial(Q). (1) 
reachable(Q) :- reachable(P), edge(P, Q). (2) 

Fig. 1. A simple bottom-up logic program for graph reachability. 



initial(a) — T edge(c,d) — T 

edge(a, c) — T edge(d,b) — T 

edge(a, d) — T edge(d, c) — T 

edge(b,b) — T edge(d, d) — T 




edge(c,a) — T 



Fig. 2. A directed graph and the corresponding initial database. 



programming. In |3]we describe products of experts, a concept from machine learn- 
ing that elucidates the kinds of probabilistic models amenable to our framework. In 
Qwe describe the PRODUCT transformation. In S|5]we give show how several well- 
known algorithms can be derived using the PRODUCT transformation applied to 
simpler algorithms. S|6]presents some variations on the PRODUCT transformation. In 
^JTjwe show how to use the PRODUCT transformation and a specially designed semir- 
ing to calculate important information theoretic quantities related to probability 
distributions over proofs. 



2 Weighted Logic Programming 

To motivate weighted logic programming, we begin with a logic program for single- 
source connectivity on a directed graph, shown in Figure [l] In the usual bottom- 
up interpretation of this program, an initial database (i.e., set of axioms) would 
describe the edge relation and one (or more) starting vertices as axioms of the form 
initial(a) for some a. Repeated forward inference can then be applied on the 
rules in Figure [l] to find the least database closed under those rules. However, in 
traditional logic programming this program can only be understood as a program 
calculating connectivity over a graph. 

Weighted logic programming generalizes traditional logic programming. In tra- 
ditional logic programming, a proof is a tree of valid (deductive) inferences from 
axioms, and a valid atomic proposition is one that has at least one proof. In weighted 
logic programming we generalize this notion: axioms, proofs, and atomic proposi- 
tions are said to "have values" rather than just "be valid." Traditional logic pro- 
grams can be understood as weighted logic programs with Boolean values: axioms 
all have the value "true," as do all valid propositions. The single-source connectivity 
program would describe the graph in Figure [2] by assigning T as the value of all the 
existing edges and the proposition initial(a). 
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2.1 Non-Boolean Programs 

With weighted logic programming, the axioms and propositions can be understood 
as having non-Boolean values. In Figure [sj each axiom of the form edge(X,Y) is 
given a value corresponding to the cost associated with that edge in the graph, and 
the axiom initial(a) is given the value 0. If we take the value or "score" of a 
proof to be the the sum of the values the axioms at its leaves and take the value of 
a proposition to be the minimum score over all possible proofs, then the program 
from Figure [T] gives a declarative specification of the single-source shortest path 
problem. Multiple uses of an axiom in a proof are meaningful: if a proof includes 
the edge(d, d) axiom once, it corresponds to a single traversal of the loop from d 
to d and adds a cost of 2, and if a proof includes the axiom twice, it corresponds 
to two distinct traversals and adds a cost of 4. 

We replace the connectives :- (disjunction) and , (conjunction) with min= and +, 
respectively, and interpret the WLP over the non- negative numbers. With a specific 
execution model, the result is Dijkstra's single-source shortest-path algorithm. 




initial(a) 
edge(a, c) 
edge(a, d) 
edge(b,b) 
edge(c, a) 



= edge(c,d) = 15 

= 4 edge(d,b) = 6 

= 20 edge(d,c) = 16 

= 8 edge(d,d) = 2 

= 9 



Fig. 3. A cost graph and the corresponding initial database. 



In addition to the cost-minimization interpretation in Figure |3j we can interpret 
weights on edges as probabilities and restate the problem in terms of probability 
maximization. In Figure |4j the outgoing edges from each vertex sum to at most 1. 
If we assign the missing 0.1 probability from vertex b to a "stopping" event — either 
implicitly or explicitly by modifying the axioms — then each vertex's outgoing edges 
sum to exactly one and the graph can be seen as a Markov model or probabilis- 
tic finite-state network over which random walks are well-defined. If we replace 
the connectives :- (disjunction) and , (conjunction) with max= and x, then the 
value of reachable(X) for any X is the probability of the most likely path from 
a to X. For instance, reachable(a) ends up with the value 1, and reachable(b) 
ends up with value 0.16, corresponding to the path a — > d — > b, whose weight is 
(value of initial(a) x value of edge(a, d) x value of edge(d, b)). 

If we keep the initial database from Figure|4]but change our operators from max = 
and X to += and x , the result is a program for summing over the probabilities 
of all distinct paths that start in a and lead to X, for each vertex X. This quantity 
is known as the "path sum" (Tarjan 1981). The path sum for reachable(b), for 
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initial(a) — 1 edge(c,d) — 0.4 

edge(a, c) = 0.2 edge(d,b) = 0.2 

edge(a,d) = 0.8 edge(d,c) = 0.3 

edge(b,b) = 0.9 edge(d,d) = 0.5 

-^^^S^l^w edge(c,a) = 0.6 

Fig. 4. A probabilistic graph and the corresponding initial database. With stopping 
probabilities made explicit, this would encode a Markov model. 



reachable(Q) initial(Q). (3) 

reachable(Q) reachable(P) ® edge(P, Q). (4) 

Fig. 5. The logic program from Figure [T] rewritten to emphasize that it is gener- 
alized to an arbitrary semiring. 



instance, is 10 — this is not a probability, but rather an infinite sum of probabilities 
of many paths, some of which are prefixes of each otherj^ 

These three related weighted logic programs are useful generalizations of the 
reachability logic program in Figure [T] Figure [5] gives a generic representation of 
all four algorithms in the Dyna language (Eisner et al. 2005). The key difference 
among them is the semiring in which we interpret the weights. An algebraic semiring 
consists of five elements (K, 0, (g), 0, 1), where IK is a domain closed under ® and (g), 
© is a binary, associative, commutative operator, (8) is a binary, associative operator 
that distributes over ®, G K is the ©-identity, and 1 G K is the (8)-identity. 
We require, following Goodman (1999[ ), that the semirings we use be complete. 
Complete semirings are semirings with the additional property that they are closed 
under finite products and infinite sums — in our running example, this corresponds 
to the idea that there may be infinitely many paths through a graph, all with finite 
length. Complete semirings also have the property that infinite sums behave like 
finite ones — they are associative and commutative, and the multiplicative operator 
distributes over them. 

In our running example, reachability uses the Boolean semiring ({t, f}, V, A, F, t), 
single-source shortest path uses (M>o U {oo}, min, -|-, oo, 0), the most-probable-path 



^ Clearly "10" is not a meaningful probability, but that is a result of the loop from b to b with 
probability 0.9 — in fact, one informal way of looking at the result is simply to observe that 
10 = 1 -I- 0.9 + (0.9)^ -I- (0.9)^ -I- . . ., corresponding to proofs of reachable(b) that include 
edge(b,b) zero, one, two, three, . . .times. If we added an axiom edge(b, final) with weight 0.1 
representing the 10% probability of stopping at any step in state b, then the path sum for 
reachable(f inal) would be 10 X 0.1 = 1, which is a reasonable probability that corresponds 
to the fact that a graph traversal can be arbitrarily long but has a 100% chance of eventually 
reaching b and then stopping. 
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variant uses ([0, 1], max, x , 0, 1), and the probabilistic path-sum variant uses the 
so-called "real" semiring (]R>o U {oo}, +, x,0, 1). 

Weighted logic programming developed primarily within the computational lin- 
guistics community. Building upon the observations of Shieber, Schabes, and Pereira 
(1995) and ,Sikkel (1997 ) that many parsing algorithms for nondeterministic gram- 
mars could be represented as deductive logic programs, Goodman (1999[ ) showed 
that the structure of the parsing algorithms was amenable to interpretation on 
a number of semirings. McAllester (19991) additionally showed that this represen- 
tation facilitates reasoning about asymptotic complexity. Other developments in- 



clude a connection between weighted logic programs and hypergraphs (Klein and 
Manning 2004), optimal A* search for maximizing programs ( Felzenszwalb and 



McAllester 20071, semiring-general agenda-based implementations (Eisner et al. 



2005), improved fc-best algorithms (Huang and Chiang 2005), and program trans- 



formations to improve efficiency (Eisner and Blatz 2007). 



2.2 Formal Definition 



Eisner and Blatz (2007) describe the semantics of weighted logic programs in detail; 
we summarize their discussion in this section and point the reader to that paper 
for further detail. A weighted logic program is a set of Horn equations describing 
a set of declarative, usually recursive equations over an abstract semiring. Horn 
equations, which we will refer to by the shorter and more traditional term rules, 
take the form 

consequent(U) ©= cLntecedenti(Wi) (g) • • • (g) cLntecedentn(W„). 

Here U and the Wi are sequences of terms which include free variables. If the 
variables in U are a subset of the variables in Wi . . . Wn for every rule, then the 
program is range restricted or fully grounded. 

We can also give rules side conditions. Side conditions are additional constraints 
that are added to a rule to remove certain proofs from consideration. For example, 
side conditions could allow us to modify rule [4] in Figure [5] to disallow self-loops 
and only allow traversal of an edge when there was another edge in the opposite 
direction: 

reachable(Q) ®= reachable(P) (g) edge(P, Q) if edge(Q, P) A Q 7^ P. (5) 

Side conditions do not change the value of any individual proof, they only filter out 
any proof that does not satisfy the side conditions. In this paper, we use mostly side 
conditions that enforce equality between variables. For a more thorough treatment 



of side conditions see Goodman (1999) or Eisner and Blatz (20071. 



A weighted logic program is specified on an arbitrary semiring, and can be in- 
terpreted in any semiring (K, ®, (g), 0, 1) as previously described. The meaning of 
a weighted logic program is determined by the rules together with a set of fully 
grounded axioms (or facts in the Prolog setting). Each axiom is assigned a value 
from the set K that is interpreted as a weight or score. 

A common idiom in weighted logic programming is to specify the query as a 
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distinguished predicate goal that takes no arguments. A computationally uninter- 
esting (because are no intermediate computation steps) but otherwise legitimate 
way to present a weighted logic program is as a single rule of the form 

goal axiomi(Wi) ® • • • (g) axiomn(Wn). 

In this degenerate case, each distinct way of satisfying the premises using axioms in 
the database would correspond to a distinct proof of goal. The score of each proof 
would be given by the semiring-product of the scores of the axioms, and the value 
of goal would be determined by the semiring-sum of the scores of all the proofs. 

In the general case, the value of the proposition/theorem goal is a semiring-sum 
over all of its proofs, starting from the axioms, where the value of any single proof 
is the semiring-product of the axioms involved. This is effectively encoded using 
the inference rules as a sum of products of sums of products of ... sums of prod- 
ucts, exploiting distributivity and shared substructure for efhciency This inherent 
notion of shared substructure means that weighted logic programming can give 
straightforward declarative specifications for problems that are typically solved by 
dynamic programming. The Dyna programming language implements a particular 
dynamic programming strategy for implementing these declarative specifications 



(Eisner et al. 2005), though the agenda algorithm that it implements may poten- 
tially have significantly different behavior, in terms of time and space complexity, 
than other dynamic programming algorithms that meet the same specification. 



In many practical applications, as in our reachability example in { 2.1 values are 



interpreted as probabilities to be maximized or summed or costs to be minimized. 



3 Weighted Logic Programs and Probabilistic Reasoning 

In this section, we will return focus to the probabilistic interpretation of weighted 
logic programs that we first described in the introduction. In §3.1| we will de- 
scribe in more detail how the results of weighted logic programs are interpreted 
as probabilities — readers with a background in statistics and machine learning can 
probably skip or skim this section. In §3.2[ we will introduce the notion of a product 
of experts that motivates the PRODUCT transformation. 

Our running example for this section is a probabilistic finite-state automaton 



1 
1 

0.5 
0.3 
0.2 
0.2 
0.8 
1.0 




final(c) — 

arc(a, b,0) — 

arc(a, b, 1) = 

arc(a, d, 0) = 

arc(b, c,0) — 

arc(b, c, 1) — 



arc(d, c, 1) — 



Fig. 6. A probabilistic FSA and the corresponding initial database. 



8 



Shay B. Cohen, Robert J. Simmons, and Noah A. Smith 



goal e= path(q) ® final(Q). (6) 

path(Q) e= initial(Q). (7) 

path(Q) 0= path(P) ®arc(P, q,A). (8) 

Fig. 7. The weighted logic program for weighted FSA reachability. 



goal 0= path(Q, I) ® f inal(Q) ® length(l). (9) 
path(Q, 0) ®= initial(Q). (10) 
path(Q, I) ®= path(P, I - 1) ® arc(P,Q, A) ® string(l, A). (11) 



Fig. 8. The weighted logic program for weighted FSA recognition. 



over the alphabet {0, 1}, shown in Figure [6j The most probable path through the 
graph is the one that recognizes the string "01" by going through states a, b, and c, 
and that the probability of this path is 0.4. Other than the labels on the edges, this 
is the same setup used in the graph-reachability example from Figure |4] The edge 
predicate from the previous section is now called arc and has been augmented to 
carry a third argument representing an output character. 

3.1 The Probabilistic Interpretation of Weighted Logic Programming 

Recall from the introduction that, in the context of weighted logic programming, 
we have random variables /, A, and P, where 

• / corresponds to a set of conditional "input" axioms known to be true; 

• A corresponds to a set of axioms known to be true; and 

• P corresponds to a deductive proof of the goal theorem using the axioms. 

In this case, / corresponds to one of the various possible sentences recognized by 
the FSA (i.e., 00, 01, 10, and 11). A corresponds to a particular directed graph 
with weighted edges, encoded by a set of axioms. P corresponds to an individual 
proof /path through the graph. In Figure [7j which is the straightforward adaptation 
of the reachability program in Figure [5] to labeled edges, the value of goal in the 
most-probable-path semiring is vaayi^proof p{P = pfoof, I — sentence \ A — graph) — 
the value of the most probable path emitting any possible sentence /. 

In order to talk about the input sentences /, we first add a set of axioms that de- 
scribe /. If we are interested in the sentence "01" we would add axioms string(l, 0), 
string(2, 1), and length(2), whereas if we were interested in the sentence "hey" we 
would add axioms string(l,h), string(2,e), string(3,y), and length(3). These 
axioms are all given the value 1 (the multiplicative unit of the semiring) , and so they 
could equivalently be treated as side conditions. With these new axioms, we modify 
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Figure [7] to obtain Figure [sj a weighted logic program that Hmits the proofs/paths 
to the ones which represent recognition of the input string /[^ 

Now, Figure |8] interpreted over the most-probable-path semiring does allow us 
to find the proof that, given the edge weights and a specific sentence, maximizes 
p{P = proof I / = sentence, A = graph). It does not, however, give us p{P = proof \ 
I — sentence, A = graph), but rather p{P — proof, I — sentence \ A = graph), the 
joint probability of a path and a sentence given the weights on the edges. 

Concretely, in our running example there are five possible proofs of goal in 
Figure [7] whose probabilities sum to 1, but there are only two parses that also 
recognize the string "01," which are aobic with weight 0.4 and aodic with weight 

0. 2 — the route through b is twice as likely. The value of goal in Figure |8] interpreted 
in the most-probable-path semiring would be 0.4 (the joint probability of obtaining 
the proof aobic and of recognizing the string "01") not 0.6 (the probability of the 
proof aobic given the sentence "01"). In other words, we have: p{P = aobic,/ = 
01 I A = Fig. [§ 0.4, p{P = aodic, I = 01 \ A = Fig. [§ = 0.2, p{P = aobic | / = 

01, A = Fig.[6]) = 0.6. 

The solution for correctly discovering the conditional probability lies in the fact 
that the joint and conditional probabilities are related in the following way: 

p{P,I\A) 



p{P\A,I) 



p{I I A) 



This, combined with the knowledge that the marginal probability p{I \ A) is the re- 
sult of evaluating Figure|8]over the path-sum semiring (i.e., (E>oU{oo}, x , 0, 1)), 
allows us to correctly calculate not only the most probable proof P of a given 
sentence but also the probability of that proof given the sentence. The marginal 
probability in our running example is 0.6, and 0.4/0.6 = 0.6, which is the desired 
result. 

To restate this in a way that is more notationally consistent with other work in 
machine learning, we first take the weighted axioms A as implicit. Then, instead of 
proofs P we talk about values y for a random variable Y drawn out of a domain 
y (the space of possible structures, which in our setting corresponds to the space 
of possible proofs) , and instead of inputs / we talk about values x for a random 
variable X drawn out of a domain X (the space of all possible inputs) . 

Then, to predict the most likely observed value for y, denoted ij, we have the 
following formula: 

p{Y = y,X = x) 

y — argmaxp(y ~ y \ X = x) ~ argmax — '- r (12) 

yey yey P{X = x) 

Because p{X = x) does not depend on y, if we only want to know y it suffices to 
find the y that maximizes p{Y = y,X ~ x) (which was written as p{P — proof , I — 



* Rule |ll| in this figure uses "I — 1" in a premise: we assume that our formalism includes natural 
numbers that support increment /decrement operations, and our simple uses can be understood 
as syntactic shorthand either for structured terms (z, s(z), etc.) or for the use of primitive side 
conditions such as inc(l, I'). 
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sentence \ A — axioms) above). One way to do this is to execute a weighted logic 
program in the most-probable-path semiring. 



3.2 Products of Experts 

Of recent interest are probability models p that take a factored form, for example: 



piY = y\X = x) « = y I X = x) X • • • X p„{Y ^ y \ X ^ x) 



(13) 



where oc signifies "proportional to" and suppresses the means by which the prob- 
ability distribution is renormalized to sum to one. This kind of model is called a 



product of experts (Hinton 2002). Intuitively, the probability of an event under p 



can only be relatively large if "all the experts concur," i.e., if the probability is large 
under each of the pi . Any single expert can make an event arbitrarily unlikely (even 



impossible) by giving it very low probability, and the solution to Equation 12 for 
a product of experts model will be the y E y (here, a proof) least objectionable to 
all experts. 

The attraction of such probability distributions is that they modularize complex 



systems ( Klein and Manning 2003 Liang et al. 2008 ) . They can also offer compu 



tational advantages when solving Equation 12 (Chiang 2007). Further, the expert 



factors can often be trained (i.e., estimated from data) separately, speeding up ex- 



pensive but powerful machine learning methods (Smith and Smith 2004 Sutton 



and McCallum 2005 Smith et al. 2005 Cohen and Smith 2007) 



To the best of our knowledge, there has been no attempt to formalize the following 
intuitive idea about products of experts: algorithms for reasoning about mutually 
constrained product proof values should resemble the individual algorithms for each 
of the two separate "factor" proofs' values. Our formalization is intended to aid in 
algorithm development as new kinds of complex random variables are coupled, with 
a key practical advantage: the expert factors are known because they fundamentally 
underlie the main algorithm. Indeed, we call our algorithms "products" because 
they are derived from "factors," analogous to the product of expert probability 
models that are derived from factor expert probability models. 

To relate this observation to the running example from this section, imagine we 
created two copies of Figure [s] which operated over the same sentence (as described 
by string and length predicates) but which had different predicates and axioms 
goali, pathi, f inali, initiali, and arci (and likewise goal2, path2, etc.). Con- 
sider a combined goal predicate goali,2 defined by the rule 



goali 



goali (g) goal2. 



(14) 



Now we have two experts (goali and goal2), and we literally take the (semiring) 
product of them, but this is still not quite the "product of experts," because the 
proofs of the goals are allowed to be independent. In other words, what we have is 
the following: 



p(n ^yi,Y2^y2\X = x) « p,{Y, =y,\X = x)x p^iY^ = y2 \ X 



The PRODUCT transformation is a meaning-preserving transformation on weighted 
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logic programs that exposes the joint structure in such a way that — depending on 
our domain-specific understanding of what it means for the two proofs yi and y2 to 
match — allows us to add constraints that result in a weighted logic program that 



forces the structures to match, as required by the specification in Equation 13 



4 Products of Weighted Logic Programs 

In this section, we will motivate products of weighted logic programs in the con- 
text of the running example of generalized graph reachability. We will then define 
the PRODUCT transformation precisely and describe the process of specifying new 
algorithms as constrained versions of product programs. 

The PRODUCT transformation can be seen as an instance of the tupling program 



transformation combined with an unfold/fold transformation ( Pettorossi and Prof 



etti 1994 Pettorossi 1999) that preserves the meaning of programs. However, we 
are interested in this transformation not for reasons of efficiency, but because it has 
the effect of exposing the shared structure of the two individual programs in such 
a way that, by the manual addition of constraints, we can force the two original 
programs to optimize over the same structures, thereby implementing optimization 
over the product of experts as described in the previous section. The addition of 
constraints requires an understanding of the problem at hand, as we show in ^ by 
presenting a number of examples. 

4-1 The Product of Graph Reachability Experts 

Figure[9]defines two experts, copies of the graph-reachability program from FigurejS] 
We are interested in a new predicate reachablei,2(Qi, Q2), which for any particular 
Qi and Q2 should be equal to the product of reachablei(Qi) and reachable2(Q2). 
Just as we did in our thought experiment with goali,2 in the previous section, we 
could define the predicate by adding the following rule to the program in Figure |9] 

reachablei,2(Qi, Q2) (S— reachablei(Qi) (g) reachable2(Q2). 

This program is a bit simplistic, however; it merely describes calculating the experts 
independently and then combining them at the end. 

The predicate reachablei,2 can alternatively be calculated by adding the fol- 
lowing four rules to Figure |9] 

reachablei,2(Qi, Q2) ffi= iiiitiali(Qi) (g) initial2(Q2). 
reachablei,2(Qi, Q2) ffi= iiiitiali(Qi) (g) reachable2(P2) <8) edge2(P2, Q2)- 
reachablei,2(Qi, Q2) ffi= reachablei(Pi) (g) edgei(Pi, Qi) (g) initial2(Q2). 
reachablei,2(Qi, Q2) ®= reachablei(Pi) (g edgei(Pi, Qi) (g) 

reachable2(P2) (g edge2(P2, Q2)- 



This step is described as an unfold by Pettorossi (1999). This unfold can then 



be followed by a fold: because reachablei,2(Qi, Q2) was defined above to be the 
product of reachablei(Qi) and reachable2(Q2), we can replace each instance of 
the two premises reachablei(Qi) and reachable2(Q2) with the single premise 
reachablei,2(Qi, Q2)- 
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reachablei(Qi) 0= initiali(Qi). (15) 
reachablei(Qi) 0= reachablei(Pi) ® edgei(Pi, Qi). (16) 

reachable2(Q2) ®= initial2(Q2). (17) 
reachable2(Cl2) ®= reachable2(P2) ® edge2(P2, Q2)- (18) 

Fig. 9. Two identical experts for generalized graph reachability, duplicates of the 
program in Figure [5j 

reachablei.2(Qi, Q2) ffi= initiali(Qi) ® initial2(Q2)- (19) 

reachablei.2(Qi, Q2) ffi= reachable2(P2) ® edge2(P2, Q2) ® initiali(Qi). (20) 

reachablei.2(Qi, Q2) ffi= reachablei(Pi) ® edgei(Pi, Qi) ® initial2(Q2)- (21) 

reachablei.2(Qi,Q2) ®= reachablei.2(Pi, P2) ® edgei(Pi, Qi) ® edge2(P2, Q2). (22) 

Fig. 10. Four rules that, in addition to the rules in Figure [9j give the product of 
the two experts defined by the reachablei and reachable2 predicates. 



The new rules that result from this replacement can be seen in Figure 10 



4.2 The PRODUCT Transformation 

The PRODUCT program transformation is shown in Figure [Tl] For each desired 
product of experts, where one expert, the predicate p, is defined by n rules and the 
other expert q by m rules, the transformation defines the product of experts for 
p»q with n X m new rules, the cross product of inference rules from the first and 
second experts. The value of a coupled proposition p»q in V' will be equal to the 
semiring product of p's value and q's value in V (or, equivalently, in V'). 

Note that lines 6-8 are nondeterministic under certain circumstances, because if 
the antecedent of the combined program is a(X) (g) a(Y) €5 b(Z) and the algorithm is 
computing the product of a and b, then the resulting antecedent could be cither 
a»b(X, Z) (g) a(Y) or a»b(Y, Z) (X) a(X). This nondetcrminism usually does not arise, 
and when it does, as in §5.2[ there is usually an obvious preference. 

The PRODUCT transformation is essentially meaning preserving: if the program 
V' is the result of the PRODUCT transformation on V, then the following is true: 

• Any ground instance p(X) that is given a value in V is given the same value in 
■p'. This is immediately apparent because the program V' is stratified: none 
of the new rules are ever used to compute values of the form p(X), so their 
value is identical to their value in V. 

• Any ground instance p»q(X, Y) in V' has the same value as p(X) (g) q(Y). 
This is the result of the following theorem: 
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Input: A logic program V and a set S of pairs of predicates (p, q). 

Output: A program V' that extends V, additionally computing the product predicate 
p»q for every pair (p, q) € S in the input. 
V ■i-P 

for all pairs (p, q) in S do 

for all rules in V, of the form p(W) 0= ® • • • (g) A„ do 
for all rules in "P, of the form q(X) ©= Bi (g) ■ ■ ■ (g) Bm do 
let r [p»q(W, X) 0= Ai (g) • • • (g) A„ (g) Bi (g) • • • (g) Bm] 
for all pairs (s(Y),t(Z)) of antecedents in r such that (s,t) € <S do 
remove the antecedents s(Y) and t(Z) from r 
insert the antecedent s»t(Y,Z) into r 
end for 
add r to V' 
end for 
end for 
end for 
return V' 

Fig. 11. Algorithmic specification of the PRODUCT transformation. 



Theorem 1 

Let V he a. weighted logic program over a set of predicates TZ, and let 5 be a set of 
pairs of predicates from V. Then after applying PRODUCT on {V,S), resulting in a 
new program V , for every (p, q) G S, the value p»q(X, Y) in V is p(X) ® q(Y). 

Proof: By distributivity of the semiring, we know that p(X) <S) q(Y) is the sum: 
v{t) i^v{r) where t and r range over proofs of p(X) and q(Y) respectively, with 

t,r 

their values being v{t) and v{r). This implies that we need to show that there is a 
bijection between the set A of proofs for p»q(X, Y) in V and the set B of pairs 
of proofs for p(X) and q(Y) such that for every s G A and {t,r) G B we have 

v{s) = v{t) » ■(;(?•). 

Using structural induction over the proofs, we first show that every pair of proofs 
{t, r) e B has a corresponding proof s G A with the needed value. In the base case, 
where the proofs t and r include a single step, the correspondence follows trivially. 
Let (t, r) G B. Without loss of generality, we will assume that both t and r contain 
more than a single step in their proofs. In the last step of its proof, t used a rule of 
the form 

p(X)®=ai(Xi)(g)---0a„(X„) (23) 

and r used a rule in its last step of the form 

q(Y)®=bi(Yi) b„(Y„) (24) 

Let ti be the subproofs of aj(Xj) and let be subproofs of hj{Yj). It follows that 
PRODUCT creates from those two rules a single inference rule of the form: 



p»q(X, Y) ®= ci ( Wi ) (g) • • • O Op (Wp) 



(25) 
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reachablei.2(Qi, Q2) ©= initiali(Qi) ® initial2(Q2)- (26) 
reachablei.2(Qi,Q2) 0= reachablei.2(Pi, P2) ® edgei(Pi, Qi) ® edge2(P2, Q2). (27) 

Fig. 12. By removing all but these two rules from the product of experts in Fig- 
ure [TO) we constrain both paths to have the same number of steps. 



reachablei,2(Qi, Q2) ffi= initiali(Qi) ® initial2(Q2 



if Qi 



(28) 



reachablei.2(Qi, Q2) reachablei,2(Pi, P2) ® edgei(Pi, Qi) ® edge2(P2, Q2 



if Qi = Q2. 



Fig. 13. By further constraining the program in Figure 12 to demand that Qi — Q2 
at all points, we constrain both paths to be identical. 



reachablei.2(Q) initiali(Q) (g) initial2(Q). (29) 

reachablei.2(Q) reachablei.2(P) ® edgei(P, q) ® edge2(P, Q). (30) 

Fig. 14. We can simplify Figure [T3| by internalizing the side condition and giving 
reachable 1,2 only one argument. 



where Ci(Wi) is either a.k{Yk) for some k, or b;(Y;) for some k, or a^ • b£(X/j, Y^) 
for some k,i. 

We resolve each case as following: 

1. If Ci(Wi) = afe(Yfc) then we set Si = tk- 

2. If Ci(Wi) = bfe(Yfc) then we set Si = rk- 

3. If Ci(Wi) — • b£(Xfc, Yf) then according to the induction hypothesis, we 
have a proof for a^ • hgCKk, Y^) such that its value is v{tk) ®v{rp). We set Si 
to be that proof. 

Since we have shown there is a proof for each antecedent of p • q(X, Y), we have 
shown that there is a proof for p • q(X, Y). That its value is indeed p(X) q(Y) 
is concluded trivially from the induction steps. 

The reverse direction for constructing the bijection is similar, again using struc- 
tural induction over proofs. □ 

4.3 From PRODUCT to a Product of Experts 

The output of the PRODUCT transformation is a starting point for describing dy- 
namic programming algorithms that perform two actions — traversing a graph, scan- 
ning a string, parsing a sentence — at the same time and in a coordinated fashion. 
Exactly what "coordinated fashion" means depends on the problem, and answering 
that question determines how the problem is constrained. 

If we return to the running example of generalized graph reachability, the pro- 
gram as written has eight rules, four from Figure [9] and four from Figure [TO] Two 
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examples of constrained product programs are given in Figures 12 - 14 In the first 



example in Figure 12 the only change is that all but two rules have been removed 



from the program in Figures [9] and 10 Whereas in the original product program 



reachablei,2(Qi, Q2) corresponded to the product of the weight of the best path 
from the initial state of graph one to Qi and the weight of the best path from the 
initial state of graph two to Q2, the new program computes the best paths from 
the two origins to the two destinations with the additional requirement that the 
paths be the same length — the rules that were deleted allowed for the possibility of 
a prefix on one path or the other. 

If our intent is for the two paths to not only have the same length but to visit 
exactly the same sequence of vertices, then we can further constrain the program 



to only define reachablei,2(Qij Q2) where Qi = Q2, as shown in Figure 13 After 
adding this side condition, it is no longer necessary for reachablei,2 to have two 
arguments that are always the same, so we can simply further as shown in Figure [T4l 
For simplicity's sake, we will usually collapse arguments that have been forced by 
equality constraints to agree. 

The choice of paired predicates S is important for the final weighted logic pro- 
gram that PRODUCT returns and it also limits the way we can add constraints 
to derive a new weighted logic program. Future research might consider a machine 
learning setting for automatically deriving S from data, to minimize some cost (e.g., 
observed runtime). When PRODUCT is applied on two copies of the same weighted 
logic program (concatenated together to a single program), a natural schema for 
selecting paired predicates arises, in which we pair a predicate from one program 
with the same predicate from the other program. This "natural" pairing leads to 
the derivation of several useful, known algorithms, to which we turn in S|5] 



5 Examples 

In this section, we give several examples of constructing weighted logic programs 
as constrained products of simpler weighted logic programs. 



5.1 Finite-State Algorithms 



We have already encountered weighted finite-state automata (WSFAs) in { 3.1 Like 
WFSAs, weighted finite-state transducers (WFSTs) are a generalization of the 
graph-reachability problem: in WFSAs the edges are augmented with a symbol and 
represented as arc(P, Q,A), whereas in WFSTs edges are augmented with a pair of 
input-output symbols and represented as arc(P, Q, A, B). Weighted finite-state ma- 



chines are widely used in speech and language processing (Mohri 1997 Pereira and 



Riley 1997). They are used to compactly encode many competing string hypotheses, 
for example in speech recognition, translation, and morphological (word-structure) 
disambiguation. Many sequence labeling and segmentation methods can also be 
seen as weighted finite-state models. 



16 



Shay B. Cohen, Robert J. Simmons, and Noah A. Smith 



goali.2 e= pathi.2(Qi,Qi)®finali(Q2)®final2(Q2). (31) 
pathi.2(Qi, Q2) e= initiali(Qi) ® initial2(Q2). (32) 



pathi.2(Qi,Q2) e= pathi.2(Pi,P2) ® arci(Pi,Qi,Ai) ® arC2(P2,Q2,A2) if Ai = A2. ;33) 



Fig. 15. The constrained product of two of the WFSA experts described in Figure[7] 



goal e= path(Q) (g) final(Q). (34) 
path(Q) e= initial(Q). (35) 
path(Q) e= path(P) (g) arc(P, Q, A,B). (36) 

Fig. 16. The weighted logic program describing weighted finite-state transducers. 



goali.2 e= pathi.2(Qi,q2) <8)finali(qi) ® final2(q2). (37) 
pathi.2(Qi, Q2) e= initiali(Qi) ® initial2(Q2). (38) 
pathi.2(Qi,Q2) e= pathi.2(Pi,P2) ® arCi(Pi, Qi, Ai, Bi) (g) arc2(P2, Q2, A2, B2) 



if Bi = A2 



Fig. 17. The composition of two weighted finite-state transducers can be derived 
by constraining the product of two weighted finite-state transducers. 



Weighted finite-state automata. Our starting point for weighted finite-state au- 
tomata will be the weighted logic program for WFSAs described in Figure [7) which 
is usually interpreted as a probabilistic automaton in the most-probable-path semir- 
ing (i.e., ([0, 1], max, x , 0, 1)). If the PRODUCT of that algorithm with itself is taken, 
we can follow a series of steps similar to the ones described in §4.3[ First, we remove 
rules that would allow the two WFSAs to consider different prefixes, and then we 
add a constraint to rule[33]that requires the two paths' symbols to be identical. The 
result is a WFSA describing the (weighted) intersection of the two WFSAs. The 
intersection of two WFSAs is itself a WFSA, though it is a WFSA where states are 
described by two terms — Qi and Q2 in pathi,2(Qi: Q2) — instead of a single term. 

Weighted intersection generalizes intersection and has a number of uses. For 
instance, consider an FSA that is "probabilistic" but that only accepts the single 
string "01" because the transitions are all deterministic and have probability 1: 




If we consider the program in Figure [15] with axioms describing the above FSA and 
the probabilistic FSA given in Figure [6] then the resulting program is functionally 
equivalent to the weighted logic program in Figure [8] describing a WFSA specialized 
to a particular string. Alternatively, if we consider the program in Figure [TS] with 
axioms describing the probabilistic FSA in Figure [6] and the following single-state 
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probabilistic FSA, the result will be a probabilistic FSA biased towards edges with 
the "1" symbol and against edges with the "0" symbol. 



0.1, ( 



Both of the above examples can be understood as instances of the product of 



experts pattern discussed in S 3.2 In the first case, the additional expert eliminates 



certain possibilities by assigning zero probability to them, and in the second case 
the additional expert merely modifies probabilities by preferring the symbol "1" to 
the symbol "0." 



Weighted finite-state transducers. Suppose we take the PRODUCT transformation of 



the WFST recognition algorithm in Figure 16 with itself and constrain the result 
by removing all but the three interesting rules (as before) and requiring that Bi 
(the "output" along the first edge) always be equal to A2 (the "input" along the 



second edge). The result is shown in Figure 17 this is the recognition algorithm 



for the WFST resulting from composition of two WFSTs. Composition permits 
small, understandable components to be cascaded and optionally compiled, forming 



complex but efficient models of string transduction (Pereira and Riley 1997). 



5.2 Context-Free Parsing 



Parsing natural languages is a difficult, central problem in computational linguistics 



(Manning and Schiitze 1999). Consider the sentence "Alice saw Bob with binocu- 



lars." One analysis (the most likely in the real world) is that Alice had the binoculars 
and saw Bob through them. Another is that Boh had the binoculars, and Alice saw 
the binocular-endowed Bob. Figure [18] shows syntactic parses into noun phrases 
(NP), verb phrases (VP), etc., corresponding to these two meanings. It also shows 
some of the axioms that could be used to describe a context-free grammar describ- 
ing English sentences in Chomsky normal form (Hopcroft and Ullman 1979|)p]A 



proof corresponds to a derivation of the given sentence in a context-free grammar, 
i.e., a parse tree. 



Shieber et al. (19951 show that parsing with CFGs can be formalized as a logic 



program, and in Goodman (1999) this framework is extended to the weighted case. 



If weights are interpreted as probabilities, then the ([0, 1], max, x , 0, 1) semiring 
interpretation finds the probability of the parse with maximum probability and the 
(R>o U {00}, -|-, x,0, 1) semiring interpretation finds the total weight of all parse 
trees (a measure of the "total grammatically" of a sentence). In Figure 19 we give 



the specification of the weighted CKY algorithm (Cocke and Schwartz 1970 



Kasami 



^ Chomsky normal form (CNF) means that the rules in the grammar are either binary with two 
nonterminals or unary with a terminal. We do not allow e rules, which in general are allowed 
in CNF grammars. 
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/^""^ ^l^^*' unary(np, "Alice" 

/ ^VP NP->-Bob miary(np, "Bob") 

/ V NP P NP P — !> with uiiary(p, "with") 
/ / 1 / I 

Alice saw Bob with binoculars ^ tvtt-. t t-t-» , . / \ 

I \ \ / S — ^ NP VP binary(s,np, vp) 



V ,Jd R NP 



1 1 



NP 

\ 
S 



NP \ ""^ VP^VNP binary(vp,v,np) 

PP ^ P NP binary(pp,p,np) 



PP 



yp_^NP NP ^ NP PP binary(np, np, pp) 



Fig. 18. An ambiguous sentence that can be parsed two ways in English (left), 
some of the Chomsky normal form rules for English grammar (center), and the 
corresponding axioms (right). There would also need to be five axioms of the form 
string(l, "Alice"), string(2, "saw"), etc. 



goali ©= starti(S) ® length(N) ® Ci(S, 0,N). (39) 
Ci(X, 1-1,1) ©= unaryi(X,W) ® string(l,W). (40) 
Ci(X,I,K) ©= binaryi(X,Y,Z)® Ci(Y,I,J)®Ci(Z,J,K). (41) 

Fig. 19. A weighted logic program for parsing weighted context-free grammars. 



1965 Younger 1967), which is a dynamic programming algorithm for parsing using 



a context-free grammar in Chomsky normal formj^ 



Figure 19 suggestively has a subscript attached to all but the length and string 
inputs. In our description of the product of experts framework in |3.2[ the axioms 
length and string correspond to the conditional input sentence /. The uncon- 
strained result of the PRODUCT transformation on the combination of the rules in 
Figure [19] and a second copy that has "2" subscripts is given in Figure [20] Un- 
der the most-probable-path probabilistic interpretation, the value of goali,2 is the 
probability of the given string being generated twice, once by each of the two prob- 
abilistic grammars, in each case by the most probable tree in that grammar. By 
constraining Figure [20] we get the more interesting program in Figure [21] that adds 
the additional requirement that the two parse trees in the two different grammars 
have the same structure. In particular, in all cases the constraints Ii = I2, Ji = J2, 
Ki = K2, Ni = N2 are added, so that instead of writing ci,2(Xi, Ii, Ji,X2, 12, J2) we 
just write Ci,2(Xi, X2, 1, J). 

Lexicalized CFG parsing. An interesting variant of the previous rule involves lexi- 
calized grammars, which are motivated in Figure |22[ Instead of describing a gram- 



Strictly speaking, the CKY parsing algorithm corresponds to a naiVe bottom-up evaluation 
strategy for this program. 
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goali.2 e= length(Ni) ® length(N2) » (42) 
starti(Si) (g) start2(S2) ® Ci.2(Si, 0, Ni, S2, 0, N2). 
Ci.2(Xi,Ii - l,Ii,X2,l2 - 1,12) e= unaryi(Xi,Wi) ® string(li,Wi) (g) (43) 

unary2(X2, W2) ® string(l2, W2). 
Ci.2(Xi,Ii - l,Ii,X2,l2,K2) unaryi(Xi,Wi) ® string(li,Wi) ® (44) 

binary2(X2, Y2, Z2) ® C2(Y2, 12, J2) ® C2(Z2, J2, K2). 
Ci.2(Xi,Ii,Ki,X2,l2 - 1,12) (S^ unary2(X2,Wi) ® string(l2,W2) ® (45) 

binaryi(Xi, Yi, Zi) ® Ci(Yi, Ii, Ji) (g) Ci(Zi, Ji,Ki). 
Ci.2(Xi,Ii,Ki,X2,l2,K2) binaryi(Xi,Yi,Zi) ®binary2(X2,Y2,Z2) ® (46) 

Ci«2(Yi, Ii, Ji, Y2, I2, J2) ® ci,2(Zi, Ji, Ki, Z2, K2, J2)- 

Fig. 20. The result of the PRODUCT transformation on two copies of Figure [T9j 



goali.2 e= length(N) ® starti(Si) ® start2(S2) ® Ci.2(Si,S2,0,N) (47) 
Ci.2(Xi,X2, I - 1, 1) e= unaryi(Xi,W) ® unary2(X2,W) ® string(l,W). (48) 
Ci.2(Xi,X2,I,K) e= binaryi(Xi,Yi,Zi) ®binary2(X2,Y2,Z2) ® (49) 

Cl.2(Yl, Y2, I, J) ® Cl.2(Zl, Z2, J, K). 

Fig. 21. The program in Figure [20| constrained to require internaUy-identical trees. 



/ 

NP / 
I VP 

/ ^ 
/ V NP 

/ / I 



VP.. 



PP 

/ \ 
P NP 
/ \ 



Alice saw Bob with binoculars 



saw 

Alice 



saw with 

/ \ / \ 

saw Bob witti binoculars 

/ I / \ 

Alice saw 3ob with binoculars 



NP-Alice ^ 
VP-saw 

/ ^ / \ 

/ I / 1 

Alice saw Bob with binoculars 



PP-with 

' \ 

NP-binoculars 



NP Alice 
P — ^ with 
S ^ NP VP 
VP ^ V NP 



Alice — > Alice 
with — > with 
saw Alice saw 
saw saw Bob 



NP-Alice — > Alice 
P-with — )■ with 
S-saw — 5> NP-Alice VP-saw 
VP-saw V-saw NP-Bob 



Fig. 22. On the left, the grammar previously shown. In the middle, a context-free 
dependency grammar, whose derivations can be seen as parse trees (above) or a 
set of dependencies (below). On the right, a lexicalized grammar. Sample rules are 
given for each grammar. 



mar using nonterminals denoting phrases (e.g., NP and VP), which is called a 
constituent- structure grammar we can define a (context-free) dependency grammar 
(Gaifman 1965) that encodes the syntax of a sentence in terms of parent-child re- 
lationships between words. In the case of the example of Figure [22j the arrows 
below the sentence in the middle establish "saw" as the root of the sentence; the 
word "saw" has three children (arguments and modifiers), one of which is the word 
"with," which in turn has the child "binoculars." 
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A simple kind of dependency grammar is a Chomsky normal form CFG where the 
nonterminal set is equivalent to the set of terminal symbols (so that the terminal 
"with" corresponds to a unique nonterminal with, and so on) and where all rules 
have the form P->PC,P— ^CP, and W — > w (where P is the "parent" word, 
C is the "child" word that is dependent on the parent, and W is the nonterminal 
version of terminal word w). 

If we encode the constituent-structure grammar in the unaryi and binaryi re- 
lations and encode a dependency grammar in the unary2 and binary2 relations, 
then the product is a lexicalized grammar, like the third example from Figure |22[ 
In particular, it describes a lexicalized context-free grammar with a product of ex- 
perts probability model (Klein and Manning 2003), because the weight given to 
any production A-X — > B-X C-Y is the semiring-product of the weight given to 
the production A — > B C and the weight given to the dependency based produc- 
tion X — > X Y. This was an important distinction for Klein and Manning — they 
were interested in factored lexicalized grammars that Figure [2T| can describe. These 
are only a small (but interesting) subset of all possible lexicalized grammars. Stan- 
dard lexicalized CFGs assign weights directly to grammar productions of the form 
A-X B-X C-Y, not indirectly (as we do) by assigning weights to a constituent- 
structure and a dependency grammar. We will return to this point in S6.2 when 
we consider the "axiom generalization" pattern that allows us to describe general 
lexicalized CKY parsing (Eisner 1997 Eisner and Satta 1999). 



Nondeterminism and rule binarization. The result of the PRODUCT transformation 



shown in Figure 20 was the first time the nondeterminism inherent in lines 6-8 of the 



description of the PRODUCT transformation (Figure 11 1 has come into play. Because 



there were two ci premises and two C2 premises, they could have been merged 
in more than one way. For example, the following would have been a potential 
alternative to rulel46l 

Ci.2(Xi,Ii,Ki,X2,l2,K2) e= binaryi(Xi,Yi,Zi) ®binary2(X2,Y2,Z2) ® (50) 

Cl«2(Yl, Il,Jl,Z2,K2,J2) ® Cl,2(Zl, Jl,Kl,Y2,l2,J2). 



However, this would have broken the correspondence between Ii and I2 and made 
it impossible to constrain the resulting program as we did. An alternative to CKY 



is the binarized variant of CKY where rule 41 is split into two rules by introducing 



a new, temporary predicate (rules 51 and 52): 



tempi(X, Y, J,K) ®= binaryi (X, Y, Z) ® Ci(Z, J, K). 
Ci(X, I,K) e= Ci(Y, I,J)(g)tempi(X,Y,J,K). 



(51) 
(52) 



In this variant, the nondeterministic choice in the PRODUCT transformation disap- 
pears. The choice that we made in pairing was consistent with the choice that is 
forced in the binarized CKY program. 
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goal ©= targetlength(M) ® predict(EM-i,EM,M+ 1). (53) 
predict(Ej_i, Ej, J + 1) ©= predict(Ej_2, Ej_i, J) ® trigram(Ej_2, Ej_i, Ej). (54) 

Fig. 23. A weighted logic program giving a trigram prediction model for a language, 
which can be generalized to an n-gram model for any n. 



goal ©= sourcelength(N) (g) trans(N, []). (55) 
trans(l',Es) ©= trans(l, []) (g) phrase(l, l', Ej :: Es). (56) 
traiis(l', Es) ©= trans(l', Ej :: Es). (57) 

Fig. 24. A weighted logic program that describes monotone decoding — translating a 
phrase at a time of the input language into the output language without reordering. 



goal ©= sourcelength(N) ® targetlength(M) ® (58) 

pr»tr(N,M,E„_i,E„,M+ 1). 
pr»tr(l', J+ l,Ej_i,Ej,Es) ©= pr»tr(l, J, Ej_2, Ej_i, []) ® (59) 

trigram(Ej_2, Ej_i, Ej) ® phrase(l, l', Ej :: Es). 
pr»tr(l', J + l,Ej_i,Ej,Es) ©= pr»tr(l', J, Ej_2, Ej_i, Ej :: Es) ® (60) 

trigram(Ej_2,Ej_i, Ej). 

Fig. 25. Phrase translation as the constrained product of Figures [23| and [24] 



5.3 Translation Algorithms 



Another example of two probabilistic models that play the role of experts arises in 
translation of sentences from one natural language to another. We will summarize 
how the PRODUCT transformation was applied to a simple form of phrase-to-phrase 
translation ( |Koehn et al. 2003[ ) by [Lopez (2009[ ) . 

Lopez (2009) suggested a deductive view of algorithms for machine translation, 



similar to the view of parsing given in Shieber et al. (19951. Lopez used the PROD- 
UCT transformation to derive an algorithm for phrase translation from two different 
factor programs, one which attempts to enforce fluency (a measure of the gram- 
maticality of a sentence) in the translated sentence and one which attempts to 
enforce adequacy (a measure of how much of the meaning of an original sentence is 
preserved in the translation.) 

If fluency is a measure of the grammaticality of a sentence, then it would seem 
that the CKY algorithm for parsing context-free grammars would be a candidate. 



While such models have been used in translation (Charniak et al. 2003), Lopez's 



example uses a simpler notion of fluency based on an n-gram language model ( Man 



ning and Schiitze 1999 Chapter 6) . An n-gram model assigns the probability of a 
sentence to be the product of probabilities of each word following the (n — l)-word 
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sequence immediately preceding it. As a concrete example, let us say that n = 3 



(called a "trigram" model) and work with the program in Figure 23 If we were es- 
timating our trigram model based on the relative frequencies of sequences in Shake- 
speare's Othello, we would note that the phrase "if it" appears eight times in the 
text. Three of these are from the sequence "if it be" and one is from the sequence 
"if it prove," so the axiom trigram("if", "it", "be") should have a probability 
that is three times the probability given to trigram("if", "it", "prove"). If we 
then stared the program with the initial sentence fragment predict("if", "it", 3), 
we could derive predict("it", "be", 4) with the aforementioned axiom and then 
predict("be", "demsLnded", 5) with the axiom trigram("it", "be", "demanded"), a 
sequence occurring once in the text. The result so far is a sequence "if it be de- 
manded" that does not appear in Othello, but which perhaps sounds like it could 
(which is an informal way of describing the criterion for fluency) . 

The weighted logic program Lopez uses to enforce adequacy is the "monotone 
decoding" logic program presented in Figure |24| The program is slightly contrived 
in order to interact with the PRODUCT transformation correctly. The atomic propo- 
sition trcLns(l,Es) refers to a particular point, I, in the source-language string 
and a list Es of unprocessed words in the target language]^ Each deduction con- 
sumes a single word (Ej) in the target language — indeed, this is the only function of 
rule|57[ When there are no words to remove, then either the entire source-language 



string has been translated (rule 55), or else progress can continue by translating 
some chunk of the source-language sentence starting from position I and ending 
at position I' as the non-empty list of target-language words Ej :: Es and applying 
rule |56| This translation of a sequence of the source-language words is captured 
by the premise phrase(l, I', Es), corresponding to the source subsequence from 
position I to position I' being translated as Es (a target- language phrase). The 
meaning of phrase could be defined by a set of axioms or by a rule. In the lat- 
ter case, if we enumerate all the substrings Ds in the source-language sentence as 
axioms substr(l, I', Ds) and provide axioms ptranslate(Ds, Es) describing source- 
language to target-language phrase translation, then phrase(l, l',Es) may be de- 
fined by the following rule: 

phrase(I, l',Es) 0= substr(l, l', Ds) (g) ptranslate(Ds, Es). (61) 

Note that substr might be provided as an axiom, or derived from axioms encoding 
the source sentence through another inference rule. 

Figure [25] displays Lopez's phrase translation program by constraining the prod- 
uct of the n-gram model and monotone decoding programs. Lopez describes this for 
any n, but for simplicity we continue using a trigram model (n = 3). The combined 
predicate simultaneously tracks a position in the source-language sentence I and 
the target-language sentence J. The word Ej that was discarded at each step in 



Figure 24 is given relevance by the trigram model. The combination of these two 



programs uses the monotone decoding program's capabilities to make sure that the 



We use a standard syntactic shorthand for lists; "[]" can be read as the constant nil and "E :: Es" 
can be read as cons(E, Es). 
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phrase- by-phrase meaning of the source-language string Di . . .Dn is preserved in 
the destination language string Ei . . . Em (adequacy) while simultaneously using 
the trigram model's capabilities to ensure that the result is a plausible sentence in 
the destination language (fluency). 

Our presentation of machine translation algorithms through the PRODUCT trans- 
formation is simplistic. Lopez (2009 1 discusses more powerful translation algorithms 
that permit, for example, reordering of phrases. 



6 Variations on PRODUCT 

Up to this point, we have viewed our use of the PRODUCT transformation as one that 
solves a problem of joint optimization: we take two logic programs that describe 
structures (such as strings, paths, or trees), relate them to one another by adding 
constraints, and then optimize over the two original structures simultaneously (one 
instance of this is when we use weighted logic programming to describe a product 
of experts.) This is a useful pattern, but it is not the only interesting use of the 
fold/unfold transformation underlying the PRODUCT transformation. In this section 
we consider two other variants: in the first we only optimize over one of the two 
structures and fix the other one, and in the second we take the output of PRODUCT as 
describing not joint optimization over two simple structures but over one complex 
structure. 



6.1 Fixing One of the Factor Structures 

The usual use of the PRODUCT transformation is to joint optimization on two struc- 
tures, but general side conditions can be used to take the additional step of fixing 
one of the two structures and having the weighted logic program perform optimiza- 
tion on the other structure, subject to constraints imposed through the pairing. 

In the setting where we consider weights to be probabilities, this is useful for 
solving certain probabilistic inference problems. Using the path-sum semiring (i.e., 
(K>o U {oo}, X , 0, 1)), the result is a program calculating the marginalized quan- 
tity p{x) = '^yPix,y) (where x corresponds to one program's proof and y to 
the other program's proof). This is a useful quantity in learning; for example. 



the expectation-maximization (EM) algorithm (Dempster et al. 1977) for opti- 
mizing the marginalized log-likelihood of observed structures requires calculat- 
ing sufficient statistics which are based on marginal quantities. Using the most- 
probable-path semiring (i.e., ([0, 1], max, x, 0, 1)), the result is a program for solving 
axgmaXy p{y \ x) — that is, for finding the most probable y given the fixed x. 

The transformation of the constrained result of the PRODUCT transformation to a 
program with one proof fixed is essentially mechanical. We consider the example of 
lexicalized parsing from Figure [22j We take the constituent-structure parse as the 
structure we want to fix in order optimize over the possible matching parses from the 
dependency grammar. The shape of the constituent-structure parse tree can be rep- 
resented by a series of new axioms that mirror the structure of the ci(X, I, J) pred- 
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pathi.2(Qi,Q2) e= initiali.2(Qi,Q2). (63) 
pathi.2(Qi,Q2) e= pathi.2(Pi,P2) ® arci.2(Pi,P2,Qi,Q2,Ai,A2). (64) 



Fig. 26. Weighted finite-state transducers as the product of two weighted finite- 
state machines. 




Fig. 27. A finite-state transducer that can be expressed as the PRODUCT of two 
finite-state automata. 



icate defining the constituent-structure grammar: proof i(s, 0, 5), proof i(np,0, 1), 
proof i(vp, 1, 5), proof i(vp, 1, 3), proof i(pp, 3, 5), and so on. 

Then we take the constrained PRODUCT of CKY that we used to describe lexical- 



ized parsing (Figure 21 1 and, wherever there was a conclusion derived from ci, we 



add a matching side condition that references proof i. Tlie critical rule (49) ends 
up looking like this: 

Ci.2(Xi,X2,I,K) e= binaryi(Xi,Yi,Zi) ® binary2(X2,Y2,Z2) ® (62) 
Ci.2(Yi,Y2,I, J) ® Ci.2(Zi,Z2, J,K) if proofi(Xi,I,K). 



The effect of this additional constraint is to disqualify any proof that does not match 
the constituent-structure grammar which we have fixed and encoded as proof i 
axioms. The idea of partially constraining CFG derivations with some bracketing 
structure was explored by Pereira and Schabes (1992). 



6.2 Axiom Generalization 

Axiom generalization is another way of manipulating products of weighted logic 
programs in a way that reveals the simple structures underlying a complex struc- 
ture. Figure [26) which is intended to describe a weighted finite-state transducer, is 
close to the weighted logic program in Figure [15] that describes the intersection of 
two finite-state machines, but there are two differences. First, we have not forced 
the two symbols to be the same; instead, we wish to interpret Ai from the first 
expert as the transducer's input symbol and A2 as the transducer's output symbol. 
Second, we have merged iiiitiali(Qi) (g) initial2(Cl2) to the single product predi- 
cate initiali,2(Qi, Q2)) and likewise for arc. As a first approximation, we can just 
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define arci,2 (and, similarly, initiali,2) by a single rule of this form: 

arci.2(Pi, P2, Qi, Q2, Ai, A2) ®= arci(Pi, Qi, Ai) ® arc2(P2, Q2, A2) (65) 

An example is given in Figure [27] Two finite-state machines, one with two states 
(a and b) and one with three states (x, y, and z), are shown — we are working over 
the Boolean semiring, so each arc in the figure corresponds to a true-valued arc 
axiom. The PRODUCT of these two experts in the manner of Figure [26] is a single 
finite-state transducer with six states. 

However, we can only describe a certain subset of finite-state transducers as 
the direct product of finite-state machines in this way. If we consider all possible 
Boolean-valued finite-state transducers with two symbols and one state, we have 
16 possible transducers, but only 10 that can be "factored" as two independent 
finite-state machines, such as these three: 

Col Col aa"'" Co\ Col r^i'^i' 

b'Xp = o, b Xp = c5 p XP = 

Six others, like the NOT transducer that outputs 1 given the input and outputs 
given the input 1, cannot be represented as the product of two FSMs. 

In many settings, limiting ourselves to the "factorable" finite-state transducers (or 
lexicalized grammars) can have conceptual or computational advantages. When this 
does not suffice, we can perform axiom generalization, which amounts to removing 
the requirement of Eq. [65] that the value of atomic propositions of the form arci,2 
be the product of an atomic proposition of the form arci and an atomic proposition 
of the form arc2. If we directly define axioms of the form arci,2, we can describe 
transducers in their full generality. 

This represents a new way of thinking about the PRODUCT transformation. Thus 
far, we have considered the result of the PRODUCT transformation as a way of 
describing programs that work over two different structures. Axiom generalization 
suggests that we can consider the PRODUCT transformation as a way of taking 
two programs that work over individual structures and deriving a new program 
that works over a single more complicated structure that, in special cases, can be 
factored into two different structures. This is particularly relevant in the area of 
lexicalized grammars and parsing where the general, more complicated structure is 
what came first and the factored models which we have considered thus far arose 
later as special cases. 

Parsing algorithms and the PRODUCT transformation. Many parsing algorithms can 
be derived by using the PRODUCT transformation as a way of deriving programs that 
do not neatly factor into two parts. Lexicalized parsing is a simple example; Fig- 
ure [28] derives a lexicalized parser by performing axiom generalization on Figure [21] 
The grammar production "P-with -> with" can be represented by including the ax- 
iom unaryi,2(p, "with"), and the binary production S-saw — > NP- Alice VP-saw 
can be represented by the axiom binaryi,2(s, "saw",np, "alice",vp, "saw"). 
Synchronous grammars are another instance in which the axiom generalization 
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goali.2 e= length(N) ® start(S) ® 01.2(8, W, 0, N). (66) 
C1.2 (X, W, I - 1, 1) e= unaryi.2 (X, W) ® string(l, W). (67) 
Ci.2(X, W, I, K) e= binaryi.2(X, W, Y, Wi, Z, W2) ® Ci.2(Y, Wi, I, J) ® Ci.2(Z, W2, J, K)(68) 

Fig. 28. A algorithm for CKY over a general lexicalized grammar derived from 
Figure [21] by axiom generalization. 



goal e= length(N) ® start(S) ® c(S, 0, N). (69) 

c(X, 1,1) e= unary(X,e) ® pos(l). (70) 

c(X, 1-1,1) e= unary(X,W) ® string(l,W). (71) 

c(X, I,K) e= binary(X,Y,Z)® c(Y, I,J)(g)c(Z, J,K). (72) 

Fig. 29. A variant of CKY that handles grammar productions of the form X — e. 



goali.2 e= lengthi(M) ® lengths (N) (g) starti.2(S) (g) (73) 
Ci.2(S,0,N,0,M). 

Ci.2(X, I- 1,1, J, J) e= unaryi.2(X,Wi,e) ® stringi(l,Wi) ® pos2(J). (74) 
Ci.2(X, 1,1, J- 1, J) e= unaryi.2(X,e,W2) (g)posi(l) (g) string2(J,W2). (75) 
Ci.2(X, I - 1, 1, J - 1, J) e= unaryi.2(X,Wi,W2) ® stringi(l, Wi) ® string2(J, W2).(76) 
ci.2(Xi,Ii,Ki,l2,K2) e= binaryi.2(X,Y,Z) (g) (77) 

Ci.2(Y, Il,Jl,l2,J2)® Cl.2(Z, Jl, Kl, J2, K2). 

Fig. 30. A simple transduction grammar derived from Figure [29} 



Ci.2(Xi,Ii,Ki,l2,K2) e= inversioni.2(X,Y,Z) g) (78) 

Ci.2(Y, Il,Jl,J2,K2)g) Cl.2(Z, Jl, Kl, I2, J2). 

Fig. 31. By adding to Figure [30| these rule corresponding to the other way that 
the Ci and C2 antecedents may be merged in the PRODUCT transformation, we can 
describe an inversion transduction grammar. 



view is interesting. A synchronous grammar can be thought of as parsing two differ- 
ent sentences in two different languages with two different grammars using a single 
parse tree. For example, if X — YZ is a grammar production in one language and 
A — >■ BC is a grammar production in another language, then X-A — >■ Y-B Z-C is 
a possible grammar production in the synchronous grammar. 

A transduction grammar ( Wu 1997 1 , is a synchronous grammar which generates 
two isomorphic derivations with a trivial alignment between the nodes of those two 
derivations. We can describe a parser for a transduction grammar with the program 
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in Figure [30l Synchronous grammars need to be able to deal with situations in which 
a word in one language does not appear in the matching sentence in the other 



language; this is done by starting from the enriched CKY program in Figure 29 
that can handle grammar productions of the form X — >■ e. 

In practice, transduction grammars do a bad job of aligning two sentences in 
different natural languages that are translations of each other, because it is often the 
case that two parts of a pair of sentences need to be in opposite positions relative to 
one another — in language one, the verb phrase might precede a prepositional phrase, 
and in language two, the corresponding verb phrase might follow the corresponding 
prepositional phrase. An inversion transduction grammar describes an alternate 
form of grammar production, which Wu (1997) writes as X ^ (YZ). This grammar 
production declares that if Ai and A2 simultaneously parse as Y in languages one 
and two (respectively) and Bi and B2 simultaneously parse as Z in languages one 
and two (respectively), then AiBi and -62^2 simultaneously parse as Z. 

Somewhat surprisingly, this inversion production rule can be described using the 
alternate allowable way of merging the premises when the PRODUCT transformation 
is performed on two copies of the CKY algorithm, as discussed in { 5.2 (see rule 50 1 . 
By adding this alternate form as given in Figure [3l] we can describe the algorithm 
for parsing with inversion transduction grammars described by Wu (1997| ). 



7 The Entropy Semiring and Kullback-Leibler Divergence 



An important construct in information theory and machine learning is the Kullback- 



Leibler (KL) divergence (Kullback and Leibler 1951). KL divergence is a function 
of two probability distributions over the same event space. It measures their dis- 
similarity, though it is not, strictly speaking, a distance (it is not symmetric). For 
two distributions p and q for random variable X ranging over events x € X, KL 
divergence is defined as 



Kh{p\\q) 



> p{X = x) log — ^ 

P{X - x) \ogp{X ^x)~Y. 



(79) 

x)logq{X = x) (80) 



xex 



-H{p) 



CE(p||q) 



where H{p) denotes the Shannon entropy of the distribution p ( Shannon 1948[ ), a 
measure of uncertainty, and CE(p||g) denotes the cross-entropy between p and 
A full discussion of these information-theoretic quantities is out of scope for this 



paper; we note that they are widely used in statistical machine learning (Koller 



In brief, the Shannon entropy of distribution p is the expected number of bits required to send 
a message drawn accor ding to p under an optimal codi ng schem e. Cross-entropy is the average 
num ber of bits required to enc ode a message in the op timal coding scheme for q when messages 
are actually distributed according to p. Hence KL(p||g) = CE(p||q) — H(p) is the average number 
of extra bits required when t he true distribution of messages is p but the co ding scheme is based 
onn^^Note^ha^^Lj^£|j^^^^=^^Oj^^ 
Kl(p||5} = +OC. 
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and Friedman 2009). In this section, we first show how the entropy of p{P), with 



P ranging over proofs of goal (the axioms corresponding to random variables A 
and / are suppressed here, for clarity), can be calculated using a weighted logic 
program, following Hwa (2004). We then describe a generalization of a result of 



Cortes et al. (2006) to show how to use PRODUCT to produce a weighted logic 
program for calculating the KL divergence between the two distributions induced 
by the WLPs. 



7.1 Generalized Entropy Semiring 

The domain of the generalized entropy semiring is (K U {+oo, — oo})'^. The multi- 
plication and addition operations are defined as follows: 



® (x2, 2/2,^2) = {xi + X2,yi + y2, zi + Z2) (81) 
(g) (a;2, 2/2,^2) = {xiX2, xiy2 + X2yi, Z1Z2) (82) 

These operations have the required closure, associativity, and commutativity 
properties previously discussed for semirings. See Cortes et al. (2006| for a proof 



which can be extended trivially to our generalized semiring. 

Suppose we have a weighted logic program such that the path-sum (in the (M>o U 
{00}, +, x,0, 1) semiring) is 1 (i.e., the value of the goal theorem is 1). If wc map 
the weights of all axioms in the original program to new values in the generalized 
entropy semiring, we can use the new semiring to calculate the Shannon entropy of 
the distribution over proofs of goal: 

— p{P ~ proof) logp{P — proof) (83) 

proof 

where x ranges over proofs of goal. The mapping is simply w t-^ {w, —w\ogw,0) . 
(The third element of the semiring value is not needed here.) If we solve the new 
weighted logic program and achieve value {w',h',0) for the goal theorem, then 
under our assumption that w' = 1 (the value of goal in the original program in the 
real semiring), h' is the entropy of the distribution over the proof random variable 



(given the axioms and goal). The formal result is given as a corollary in {7.2 

This semiring can be used, for example, with the CKY algorithm from Fig- 
ure 



19 It makes the derivation of the tree entropy for context-free grammars (i.e., 
the entropy over the context-free derivations for an ambiguous string) automatic, 
and obviates the design of a specific algorithm for computing the tree entropy for 



probabilistic context-free grammars, as described in Hwa (2004). With the CKY 



algorithm, a proof proof in Eq. represents a derivation in the grammar. Simi- 



larly, a weighted logic program describing a finite-state transducer (Figure 16 ) can 



be used to compute the entropy of hidden sequences for hidden Markov models as 



described by Hernando et al. (2005 ) 



We now relax the assumption that the sum of all proof scores is 1. Suppose that 
the value of the goal theorem in the generalized entropy semiring is {w',h',0), 
with 7^ 1. In this case, h' is not the entropy of a proper probability distribution. 
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We can renormalize the scores of the proofs, u{proof), by dividmg by w' , treatmg 
them as a proper conditional distribution (conditioning on the truth of the goal 

u (proof ) 

theorem); then the entropy of this conditional distribution, , is 

_ y ujproof) ujproof) 

proof 



1 

w' 

1 

w' 



u{proof){\ogu{proof) — \ogw') 

proof 



\ 1 k' 

i' + (logw') u{proof ) = — (h' + w' log w') — — + log w' 
^-^ j w' w 

proof y 



Therefore, whenever we can use weighted logic programming (in the real semiring) 
to calculate sums of proof scores, we can use the generalized entropy semiring to 
find the Shannon entropy of the (possibly renormalized) distribution over proofs. 
The renormalization uses w' and ft.', two quantities that are calculated directly 
when we use the generalized entropy semiring. 



7.2 KL Divergence Between Proof Distributions and PRODUCT 



Cortes et al. (2006 1 showed how to compute the KL divergence (also called relative 
entropy) between two distributions over strings defined by probabilistic FSA, using 
a construct similar to our generalized entropy semiring. We generalize that result 
to KL divergence over two proof distributions p{P) and q{P) given by a weighted 
logic program V . We assume in this discussion that the set of axioms with non-zero 
weights are identical under p and under q; the general setting where this does not 
hold is correctly handled, using alogO = —oo and Ologa = for all a > 0. 

We abuse notation slightly and use p and q to denote the values of axioms, 
theorems, and proofs in the real semiring weighted logic programs used to calculate 
the sum of proof-scores for goal under axioms weighted according to p and q. Let 
Proofs(t) denote the set of logical proofs of a theorem i, and for x E Proofs(t), let 
p{t) — respectively, q{t) — denote the score of the proof x: 

pit) - ^^(^) (85) 

a:eProofs(t) 

q{t) = (86) 

a;£Proofs(i) 

(87) 

We seek the KL divergence: 

KUp\\q)= Y M^^) log 44 (88) 

a:G Proofs (goal) 

In order to accomplish this calculation, we will first map the weights of axioms 
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under p and q into the generalized entropy semiring as follows, for any axiom a: 

{p{a),q{a)) ^ {p{a),p{a) log q{a) , q{a)) (89) 

For a theorem t, let 

R{t)= Pi^)^ogq{x) (90) 

xtEProofs(i) 

Theorem 2 

Solving V in the generalized entropy semiring with weights defined as above results 
in goal having value (p(goal), i?(goal), (/(goal)). 

Proof: We will treat the weighted logic program as a set of equations with all left- 
hand-side variables grounded. We will use upper-case to refer to free variables (e.g., 
Z = {Zi, . . .)) and lower-case to refer to grounded values (e.g., z = (zi, . . .)). The 
range of values that variables Z can get is denoted by Rng(Z). The weighted logic 
program can be seen as a set of equations: 

c(w)= ai(w',z)®bi(w",z) (91) 

[c(w) ©=ai(w',Z)(g)bi(w",Z)]e'P,w'Cw,w"Cw zeRng(Z) 

(Note that any of w, w', w", and z may be empty.) 
We now show that the value achieved for c(w) when solving in the semiring is 

(p(c(w)), p{x)\ogq{x),q{c{w))) (92) 

a; G Proofs (c(w)) 

where Proofs(c(w)) denotes the set of proofs for c(w). We will show that the 



solution of Equations 91 is the value in Equation 92 for c(w). 

For the first and third coordinates, this equality follows naturally because of the 
definition of the generalized entropy semiring: the first and third coordinates are 
equivalent to the non-negative real semiring used for summing over proof scores 
under the two value assignments p and q, respectively. 

Consider a particular ©-addend to the value of c(w), 

ai(w',z) ®bi(w",z) (93) 

= (p(ai(w', z)), i?(ai(w', z)), (z(ai(w', z))) 

® (p(bi(w", z)), i?(bi(w", z)), g(bi(w", z))) (94) 
/ P(ai(w',z))p(bi(w",z)), \ 

= ( p(ai(w',z))i?(bi(w",z))+p(bi(w",z))i?(ai(w',z)), ) (95) 
\ (7(ai(w',z)))g(bi(w",z)) / 

Consider the second coordinate. 

p(ai(w', z))i?(bi(w", z)) + p(bi(w", z))i?(ai(w', z)) (96) 

p(ai(w',z)) Y P{x)^ogq{x) 

a:£Proofs(bi (w",z)) 
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+ |p(bi(w",z)) p[x')\ogq(x')\ (97) 

x'GProofs(ai (w' ,z)) / 

X! X! P{x)\ogq{x) 

k a;' GProofs(ai (w' ,z)) Proofs (bi (w'' ,z)) 

^ p{x) p{x')\ogq{x')\ (98) 

Proofs (bi (w" ,z)) x'GProofs(ai (w' ,z)) / 

= E E p(a:)p(x')log(g(a;)(7(a;')) (99) 

a; tE Proofs (bi (w" ,z)) a:' Proofs (ai (w' ,z)) 

Embedding the above in a ©-summation over z and a ©-summation over inference 
rule instantiations gives a ©-summation over proofs of c(w), 



Y P{x)\ogq{x) 

a: £ Proofs (c(w)) 



(100) 



which is i?(c(w)) as desired. □ 
Denote by {p, R, q) the value for goal in the generalized entropy semiring as 
discussed above, i.e., p = p(goal), R = i?(goal), and q = q(goal). If we wish to 
renormalize phy p and g by g to give proper distributions over proofs of goal (given 
axioms and goal), then 



CE 

Noting that ~H{p) = CE{p\\p) 
KL 







R 








\p 


9 / 


P 









r) 


\P 





= CE(pIIp) - CE{p\\q) 



(101) 



(102) 



we can solve for the KL divergence of two (possibly renormalized) distributions 
p and q using the above results. Alternatively, if the generalized KL divergence 



between unnormalized distributions is preferred (O'SuUivan 1998), note that (in 
the notation of the above): 



E 

a:GProofs(goal) 



p{x) log -r\ ~ vi^) + 9(2^) 

q{x) 



R-p 



(103) 



Cortes et al. describe how to compute KL divergence between two probabilistic 
finite-state automata with a single path per string ("unambiguous" automata). 



The authors make use of finite-state intersection (discussed above in S5.ll. This 
suggests an analogous interpretation of the PRODUCT transformation for computing 
KL divergence between two weighted logic programs. 

Let V and Q be two instances of a weighted logic program, with possibly different 
different axiom weights. Assume we set the values of the axioms of V (ranging over 
a) to be (p(a),0, 1), and for Q we set them to (1, log g(a), g(a)). If we take a PROD- 
UCT of V and Q, using the "natural" pairing, then we end up with a program that 
computes (p(goal), i?(goal), q(goal)) in the generalized entropy semiring, where 
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goal ©= path(P, Q) ® final(Q). (104) 
path(null,Q) 0= initial(Q). (105) 
path(P',Q) e= path(P,P') (g)biarc(P,P',q,A,B). (106) 

Fig. 32. Weighted finite-state transducer where arriving at a certain state depends 
on the last two states, null serves as a place-holder for the non-state prior to the 
initial state. 

goali.2 e= pathi.2(P,Q)®finali(q)®final2(Q). (107) 
pathi.2(null, Q) 0= initiali(Q) ® initial2(Q). (108) 
pathi.2(P',Q) e= pathi.2(P,P')®arci(P',Q,A,B)®biarc2(P,P',Q,A,B) (109) 

Fig. 33. The PRODUCT program of Figure [T6| with Figure |32] with constraints that 
match proofs according to states and emissions sequences. 



R{-) is specified in Eq. 90 These quantities can be used to compute KL divergence 



as specified in Eq. |102| This is a direct result of Theorem [2] 



7.3 KL Divergence and Projections 

We can use PRODUCT to calculate KL divergence between proof distributions even 
when V and Q are not two instances of the same program. We consider cases where 
the proofs of V and the proofs of Q have a shared semantics, that is, each proof of 
either V oy Q maps to an event in some "interpretation space." 

As an example, consider the WLP in Figure [16] describing a weighted finite-state 
transducer. In a more general formulation, where each state depends on the previous 
N states visited, rather than just the single most recent state. This modification is 



reflected in Figure 32 for N — 2. The axiom biarc(P', Q, P, A, B) is to be interpreted 
as: "if the last two states were Q and P', transfer to state P while reading symbol 
A and emitting the symbol B." Since the two programs have different axioms, the 
spaces of their respective proofs are different. However, both programs have identical 
semantics to a proof; a proof (in either program) corresponds to a sequence of states 
that the transducers go through together with the reading of a symbol and the 
emission of another symbol. 



Running PRODUCT on the WEST in Figure 16 (we caU it V) and the WEST in 



Figure 32 (we call it Q) with a particular pairing and constraints (such that the 



paths are identical) yields the program in Figure 33 If we let the axioms a in 7^ 
have the values (p(a),0,l) and the axioms a in Q the values (1, log 9(0), g(a)), then 
the resulting PRODUCT program in Figure [33} as implied by Theorem [2j calculates 
the KL divergence between two distributions over the set of state paths: one which 
is defined using a finite-state transducer with = 1 and the other with N = 2. 
We now generalize this idea for two different programs P and Q. We assume 
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that PRODUCT is applied in such a way that axioms from V are paired only with 
axioms from Q, and vice versa. Further, each proof in the PRODUCT program must 
decompose into exactly one proof in V and one proof in Qj^ For a proof in the 
PRODUCT program, y, we define 7r-p(y) (7rg(y)) to be the projection of y to a proof 
in P ( Q) . The "projection" of a proof is a separation of the proof which uses coupled 
theorems and axioms into theorems and axioms of only one of the programs. For 
example, projecting a proof y in the product program in Figure [33] yields two proofs: 
7r-p(y) describes a sequence of transitions through the transducer with = 1 and 
TTQ{y) describes a sequence of transitions through the transducer with N = 2; yet 
both proofs correspond to the same sequence of states. 

In the generalized entropy semiring, we set the values of the axioms of V to 
be (p(a),0,l), and for Q we set them to (1, log g(a), ^(a)). The PRODUCT program 
computes (p(goali), i?(goali.2), g(goal2)). This time, the summation in JS(goali.2) 
is over proofs which are implicitly paired: 

i?(goali.2) = PMy))'^ogq{TTQ{y)) (110) 

l/eProofs(goali.2) 

The quantities p(goali), i?(goali.2), and g(goal2) can be used as before to com- 
pute the KL divergence between the distributions over the shared "interpretation 
space" of the proofs in the two programs. This technique is only correct when in- 
terpretations are in a one-to-one correspondence with the proofs in V and with the 
proofs in Q, and PRODUCT is applied so that equivalently-interpretable proofs in 
the two programs are paired. 

We note that in the general case, the problem of computing KL divergence be- 
tween two arbitrary distributions is hard. For example, with Markov networks, there 
are restrictions, which resemble the restrictions we pose, of clique decomposition 
(iKoUer and Friedman 20091. 



8 Conclusion 

We have described a framework for dynamic programming algorithms whose solu- 
tions correspond to proof values in two constrained weighted logic programs. Our 
framework includes a program transformation, PRODUCT, which combines the two 
weighted logic programs that compute over two structures into a single weighted 
logic program for a joint proof. Appropriate constraints, encoded intuitively as vari- 
able unification or side conditions in the weighted logic program, are then added 
manually. The framework naturally captures and permits generalization of many 
existing algorithms. We have shown how variations on the the program transfor- 
mation enable to include a larger set of algorithms as the result of the program 
transformation. We have concluded by showing how the program transformation 

^ Note that these constraints are satisfied in the case of two identical programs with the "natural" 
pairing, as in i |7.2| 
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can bo used to interpret the computation of Kullback-Leibler divergence for two 
weighted logic programs which are defined over an identical interpretation space. 
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